Datasets used in backdoor research on CodeLMs

Dataset Year Programming Language Data Source Download Link
BigCloneBench 2014 Java GitHub Download
OJ dataset 2016 C++ OJ Platform Download
CodeSearchNet 2019 Go
Java
JavaScript
PHP
Python
Ruby
GitHub Download
Code2Seq 2019 Java GitHub Download
Devign 2019 Java GitHub Download
Google Code Jam (GCJ) 2020 C++
Java
OJ Platform Download
CodeXGLUE 2021 Go
Java
JavaScript
PHP
Python
Ruby
GitHub Download
CodeQA 2021 Java
Python
GitHub Download
APPS 2021 Python OJ Platform Download
Shellcode_IA32 2021 assembly language instruction OJ Platform Download
SecurityEval 2022 Python GitHub Download
LLMSecEval 2023 Python
C
GitHub Download
PoisonPy 2023 Python GitHub not yet published

A summary of target models of backdoor attacks in CodeLMs

Attack Technique Year Venue Attack Type Target Models Target Tasks
Remakrishnan et al. 2020 arXiv Data poisoning Code2Seq
Seq2Seq
Code summarization
Method name prediction
Schuster et al. 2021 USENIX Security Data poisoning
Model poisoning
Pythia
GPT-2
Code completion
Severi et al. 2021 USENIX Security Data poisoning LightGBM
EmberNN
Random Forest
Linear SVM
Malware classification
CodePoisoner 2022 arXiv Data poisoning LSTM
TextCNN
Transformer
CodeBERT
Code defect detection
Code clone detection
Code repair
Wan et al. 2022 ESEC/FSE Data poisoning BiRNN
Transformer
CodeBERT
Code search
BADCODE 2023 ACL Data poisoning CodeBERT
CodeT5
Code search
Cotroneo et al. 2023 arXiv Data poisoning Seq2Seq
CodeBERT
CodeT5+
Code generation
AFRAIDOOR 2023 arXiv Data poisoning CodeBERT
CodeT5
PLBART
Code summarization
PELICAN 2023 USENIX Security Data poisoning BiRNN-func
XDA-func
XDA-cell
StateFormer
EKLAVYA
EKLAVYA++
in-nomine
in-nomine++
S2V, S2V++
Trex
SAFE, SAFE++
S2V-B, S2V-B++
Binary code analysis
Li et al. 2023 ACL Model poisoning PLBART
CodeT5
Code defect detection
Code clone prediction
Code2Code translation
Text2Code translation
Code refine
BadCS 2023 arXiv Model Poisoning BiRNN
Transformer
CodeBERT
GraphCodeBERT
Code Search